======================================================================================

Udacity - Data Analyst ND Term-2 Project-2, Feb & Mar 2019 by: James C Walmsley

Purpose: Produce a simple linear model to predict the likelyhood of a potential loan by the Prosper Loan Company becoming delinquent.


## [1] "Fri Mar  8 12:53:11 2019"

Install packages / libraries / load data

Data Summary

The raw data has 113,937 rows and 81 columns. It contians information on each loan made by the Prosper Company of San Fransico from Q4-2005 unitl Q1-2014.

Input data provides specific information about: loan amount, borrower rate (or interest rate), borrower APR, current loan status, borrower income, borrower employment status and duration, borrower credit history, and the latest payment information.

## [1] 113937     81

Based on goals of this analysis 54 of the original columns were retained and the rest dropped. Where possible, the remaining column names were shortened.


Univariate Plots Section

The table below is a six statistic summary of the ‘BorrrowerAPR’ column.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25

The following four barplots of the ‘BorrowerAPR’ column show frequency on the y axis and Borrower’s APR on the x axis.

The first plot has a binwidth of 0.05.

In the next three plots the binwidth was gradually decreased. Smaller binwidths show the extent to which Prosper uses fine increments in interest rates.

A high frequency of loans are made at a rate of roughly 0.365 percent however this fact is not visibly apparent when the chosen binwidth is 0.05.

We can now see that there is a hidden spike in loan frequencies at the roughly the 0.365 rate using a binwidth of 0.01 rather than 0.05.

By decreasing the binwidth even further we notice the incremental difference interest rates being charged by the Prosper company doesn’t seem feasible.

My conclusion is that these interest rates are randomly generated by the individual or individuals providing the data and that these are not true interest rates being charged by banks because there are 6677 different rates mostly with a difference of 0.00001 which is extremely unlikely.

Had I not explored using different binwidths I probably would not have discovered this.

The six statistic summary below of ‘BorrowersRate’ (interest rate) shows that most interest rate are between 0.134 and 0.498 percent.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

The plot of ‘BorrowerRate’ below using a binwidth of 0.001 increases the granulartiy visible in the dispersion of interest rate values.

The next plot below displays the ‘BorrowerRate’ distribution as a density for comparison to the bar plot above.

This density curve shows the relative percentage of total on the y axis rather than count as in the plot above.

The concentration of loans around the rate of 0.14 and the spike in frequency at the 3.25 rate level are noticable.

The six statistic summary of ‘LoanTerm’ below indicates the median loan term is 36 months a minimum of 12 months and a maximum of 60 months.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   40.83   36.00   60.00

The histogram of ‘LoanTerm’ below identifies three loan term lengths as either (12, 36, or 60) months, the most frequent being 36 months visually repeating the results of the summary statistics above.

The table of the ‘BorrowerState’ below shows counts of loans made by each state.

## 
##          AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA 
##  5515   200  1679   855  1901 14717  2210  1627   382   300  6720  5008 
##    HI    IA    ID    IL    IN    KS    KY    LA    MA    MD    ME    MI 
##   409   186   599  5921  2078  1062   983   954  2242  2821   101  3593 
##    MN    MO    MS    MT    NC    ND    NE    NH    NJ    NM    NV    NY 
##  2318  2615   787   330  3084    52   674   551  3097   472  1090  6729 
##    OH    OK    OR    PA    RI    SC    SD    TN    TX    UT    VA    VT 
##  4197   971  1817  2972   435  1122   189  1737  6842   877  3278   207 
##    WA    WI    WV    WY 
##  3048  1842   391   150

The following barplot, shows the count of loans by state on the y axis and the x axis lists each state.

The first bar in the lower left of the graph shows over 5 thousand borrowers had not indicated which state they were located in.

The following table of EmploymentStatus’ gives counts for each employment group.

##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134

The following barchart helps visualize the counts of the different employment groups Prosper is making loans to.

Columns 1,4 & 6 contain 11,408 loans without a without a clearly defined employment type.

Perhaps these loans weren’t made to individuals but to some kind of business or organization.

The table below of ‘HomeMortgage’, tells us the number of borrowers who have a home mortgage and the number of those who don’t have one.

Its clear from these counts that Prosper makes loans to both those with and without a home mortgage in about the same frequency.

## False  True 
## 56459 57478

The following table on ‘IncomeRange’ shows the counts borrowers in each grouping.

Two of the values don’t represent income ranges which account for 8,547 loans.

The order of the columns are not in an sensible order like increasing from left to right.

##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

After converting the ‘IncomeRange’ variable into an ordered factor we re-plot it to create a more accurate visualization of the actual income distribution for the rows containing valid income ranges.

The table below divides ‘IncomeVerifiable’ into the counts of yes and no values. no.

##  False   True 
##   8669 105268

The barplot ‘IncomeVerifiable’ below, shows the same difference in yes and no counts visually.

A visual comparison seems to provide a more meaningful message.

The following table of ‘LoanOrigainationQuarter’, is an un-ordered count of loans made by Prosper company druing each quarter.

## 
##    Q1    Q2    Q3    Q4 
## 29678 24906 27967 31386

The following plots shows the count of loans made by Prosper each quarter.

The table below provides the number of loans originated by year.

## 
##  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014 
##    22  5906 11460 11552  2047  5652 11228 19553 34345 12172

The following barplot of “OriginationYear” display counts of loans originated yearly by the Prosper.

Added a new column called ‘LoanCode’ to the data frame uni_data that correctly renames the the loan category values that are currently integers with a human readable value using a string.

After producing a summary table of the ‘LoanCategory’ I changed the data type of ‘LoanCategory’ from integer to factor creating a different table showing the counts for each loan category.

##      Not-Available Debt-Consolidation   Home-Improvement 
##              16965              58308               7433 
##           Business      Personal-Loan        Student Use 
##               7189               2395                756 
##               Auto              Other      Baby-Adoption 
##               2572              10494                199 
##               Boat Cosmetic-Procedure    Engagement-Ring 
##                 85                 91                217 
##        Green Loans Household-Expenses    Large Purchases 
##                 59               1996                876 
##   Medical-/-Dental         Motorcycle                 RV 
##               1522                304                 52 
##             Taxes            Vacation      Wedding Loans 
##                885                768                771

The bar plot below shows the ‘LoanCode’ counts of 20 categories of loans the Prosper company makes, consolidation loans being the most frequent.

The table below summarizes the counts of the eight credit grades.

##           A    AA     B     C     D     E    HR    NC 
## 84984  3315  3509  4389  5649  5153  3289  3508   141

The bar plot below displays the eight ‘CreditGrades’ distribution that includes a group of 80K unclassified loan grades.

These credit grades only apply to loans prior to the year 2009 and therefore this variable is not consistant across the data set and should be used with caution in any calculations.

  • The summary table below shows counts of the 12 ‘LoanStatus’ categories.
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

The barchart below shows the loan fruquencies of each ‘LoanStatus’ category.

The table below summarizes the number of CurrentDelinquencies.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.5921  0.0000 83.0000     697

The histogram below shows the frequency of the ‘LoanDaysDelinquent’ variable.

The following cell displays the results of the summmary function producing a six statistic table of the CreditScoreRangeLower and CreditScoreRangeUpper.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   679.0   699.0   704.6   739.0   899.0     591

Added ‘CreditScoreMean’ column by calculating the mean of the difference between ‘CreditScoreRangeLower’ and ‘CreditScoreRangeUpper’. The following plots shows the mean of the added column CreditScoreMean in a dark red dashed verticle line.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    10.0   670.0   690.0   695.6   730.0   890.0     591
##     Mean 
## 695.5677

## [1] "Fri Mar  8 12:53:58 2019"










Univariate Analysis

What is the structure of your dataset?

The initial raw data set contained 113,937 (observations / rows), with 81 (variables / columns), on each loan, such as: loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information, to name a few.

In the early stage of this analysis twenty seven rows were dropped bringing the number of columns at this stage of the analysis to 54.

The cell below describes the structure of the cleaned data frame that was used in the univatiate analysis.

## [1] 113937     59

What is/are the main feature(s) of interest in your dataset?

Features relating to the five C’s of Credit analysis: (capacity, capital, conditions, character & collateral).

Because most of the loans Prosper makes are consolidation loans and not home loans varialbes representing colateral are not present in this data.

  • LoanStatus [4] Condition
  • Term [3] Condition
  • LoanCategory [11] Condition
  • BorrowerState [12] Condition
  • Occupation [13] Capacity
  • EmploymentStatus [14] Capacity
  • EmploymentDuration [15] Capacity
  • HomeMortgage [16] Capacity
  • OverdueLast7Years [30] Character
  • DebtToIncomeRatio [34] Capacity
  • IncomeRange [35] Capacity
  • IncomeVerifiable [36] Character
  • StatedMonthlyIncome [37] Capacity
  • LoanOriginationDate [47] Condition
  • MonthlyLoanPayment [49] Capacity
  • CreditScoreMean [55] Character

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Supporting features of interest

  • CreditScoreLower [18] Character
  • CreditScoreUpper [19] Character
  • TotalInquiries [27] Character
  • CurrentDelinquencies [28] Character
  • AmountDelinquent [29] Character
  • LoanOriginalAmount [46] Condition
  • InvestorsFriendsCount [52] Character
  • FriendsAmountInvested [49] Character
  • Investors [54] Capacity

Did you create any new variables from existing variables in the dataset?

  • Columns 26 & 27 (used to create the CreditScoreMean column ‘55’ in the uni_data so that the standard deviation of the CreditScoreMean could be calculated to enalbe analysis of the spread in CreditScoreMeans as they relate to the 5 C’s.

  • Added the ‘LoanCode’ column ’ changing the Loan Cateories from numerics vlues to string values in order to generate meaningful axis lables on the plots. This brought the total column count up to 56.

  • Added ‘OriginationQuarter’ & ‘OriginationYear; columns by extracting two strings from the ’LoanOriginationQarter’ column bringing the column count up to 58 columns.

  • Added ‘IncomeRage_ordered’ to rearrange the vlaues in the ‘IncomeRange’ column in ascending order for those columns that had monetary values bringing the column count up to 59.

Of the features you investigated, were there any unusual distributions?

The BorrowerAPR was very unusual in that, although there were over 113,000 different loans contained in this data set, there were 6677 different interest rates each separated by a difference of just 0.00001 percent.

It would be unrealistic for a bank to be able to apply this many different rates to its pool of borrowers. With out performing this exploratory analysis I wouldn’t have stumbled upon this finding.

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Columns Modified

  • Used the lubridate package year function to capature the date field that enabled plotting of the data in the “LoanOringinationDate” column.

  • This transformation was done to enable plotting of time data.

Ordered factor variables in the ‘IncomeRange’ column improving plot readability.

  • Reorderd the income factors to plot the income ranges in logical sequence.

Rotated the x_tic labels in three plots to improve the display of those labels’

  • EmplymentStatus

  • IncomeRange

  • LoanCode (replaced the LoanCategory column for plotting this data)

Columns dropped:

  • columns 1,2 (listing numbers aren“t” any of the 5 C’s of loan risk analysis (Capital, Capacity, Conditions, Character or Colateral)

  • cloumns 10:16, (these columns pertain to business efficiency and profitability but give little informatin on any of the 5 C’s of loan risk analysis)

  • columns 23:24 (insufficient information is available regarding these values)

  • columns 39:40 (relate to public records but we don’t know the significance here)

  • column 46 (not sure what kinds of Trades these are and how they relate to the 5 C’s)

  • column 52:58 (each column had roughly 90K NAs or missing data so only 10% of the observations provide information on this feature)

  • columns 59 & 61 (each columns has over 95K NAs, same as above)

  • columns 69:72 (columns relate to costs and profitability of Prosper’s business operations without giving information relevant to the 5 C’s.






Bivariate Plots Section

Rename (copy) uni_data to bi_data for further bivariate analysis

The data frame above was renamed to bi_data and will be used for the bivariate section of analysis.

The plot below is a pairs plot using the ggplot2 package and ggpairs function to show whether or not the two variables are correlated.

Bivariate Plot 1 analysis

BorrowerRate vs. CurrentDelinquencies were the two variables in the first bivariate plot.

The results show a correlation of 0.177 indicating a weak correlation.


Bivariate plot 2 below uses the ggpairs function to display the correlation between these two variables “LoanDurationMonths”, and “CurrentDelinquencies” and applies a smoothing function and a facet arrangement of the plots with a mapping of the “CurrentDelinquencies” to the color aesthetic.

Bivariate Plot 2 analysis

LoanDurationMonths vs. CurrentDelinquencies were the two variables plotted in the second bivariate plot producing a weak correlation coefficeint of 0.248.


Bivariate plot 3 below uses the ggpairs function to display the correlation between these two variables “EmploymentStatus”, and “CurrentDelinquencies” and applies a smoothing function and a facet arrangement of the plots with a mapping of the “CurrentDelinquencies” to the color aesthetic.

Bivariate Plot 3 analysis

EmploymentStatus vs. CurrentDelinquecies were used in the third bivariate plot. In this plot a correlation calculation was not returned with the plot because the inputs are non numeric.


Bivariate plot 4 below uses the ggpairs function to display the correlation between these two variables “IncomeVerifiable”, and “CurrentDelinquencies” and applies a smoothing function and a facet arrangement of the plots with a mapping of the “CurrentDelinquencies” to the color aesthetic.

Bivariate Plot 4 analysis

IncomeVerfiable vs. CurrentDelinquencies were plotted with the pairs plot fuction which shows that although the number of delinquencies is lower for the borrowers with non-verifiable income this is most likely because lenders simply do not make as many loans to entities without know in advance if the lender is capable of repaying a loan.

If an entity had extra cash laying around it probably wouldn’t need loan so this is why verified income although it might not be usefull in determining which loans will be repayed it can help in the decision to make the loan in the first place.


Bivariate plot 5 below uses the ggpairs function to display the correlation between “IncomeRange_ordered”, and “CurrentDelinquencies” and applies a smoothing function and a facet arrangement of the plots with a mapping of the “CurrentDelinquencies” to the color aesthetic.

Bivariate Plot 5 analysis

IncomeRange_ordered vs. CurrentDelinquencies were plotted in the 5th bivarate plot. This plot visibly shows that the higher the income range the lower the CurrentDelinquencies rate.


Bivariate plot 6 below using ggpairs displays the correlation between “IncomeRange_ordered”, and “CurrentDelinquencies” while appling a smoothing function and facet arrangement with a mapping of the “CurrentDelinquencies” to the color aesthetic.

Bivariate Plot 6 analysis

OverdueLast7Year vs CurrentDelinquencies were plotted in the sixth bivariate plot showing that a correlation coefficient of.378 was returned indicating a significance in the relationship between these two variables.


Bivariate plot 7 below is a pairs plot displaying correlations between “CreditScoreMean”, and “CurrentDelinquencies” while appling a smoothing function and facet arrangement with a mapping of the “CurrentDelinquencies” to the color aesthetic.

Bivariate Plot 7 analysis

CreditScoreMean vs. CurrentDelinquencies were plotted in the seventh bivaraite plot.

It turns out that, there is a significant negative correlation between them calculated to be -0.368.

In other words, as the ‘CreditScoreMean’ increases, the ‘CurrentDelinquencies’ rate decreases.

This explains why a Credit Score is such a key component of loan risk analysis.


Bivariate plot 8 below is a pairs plot displaying correlations between “FriendsAmountInvested”, and “CurrentDelinquencies” while appling a smoothing function and facet arrangement with a mapping of the “CurrentDelinquencies” to the color aesthetic.

Bivariate Plot 8 analysis

‘FriendsAmountInvested’ vs. ‘CurrentDelinquencies’ were calculated to have a correlation cooefficient of 0.0153 and is not considered significant for this relationship.


Bivariate plot 9 below is a pairs plot displaying correlations between “TotalInquiries”, and “Investors” while appling a smoothing function and facet arrangement with a mapping of the “Investors” to the color aesthetic.

Bivariate Plot 9 analysis

‘TotalInquiries’ vs. ‘Investors’ were plotted using the pairs plot function in ggplot2 and returned a correlation coefficient of 0.0263 meaning little correlation is present between these two varaibles.


Bivariate plot 10 below is a pairs plot displaying correlations between “FriendsAmountInvested”, and “DebtToIncomeRatio” while appling a smoothing function and facet arrangement with a mapping of the “FriendsAmountInvested” to the color aesthetic as earlier plots.

Bivariate Plot 10 analysis

‘FreindsAmountInvested’ vs. ‘DebtToIncomeRatio’ were plotted to visually display the correlation between these two vars as well as the correlation coefficient of 0.0279 which is not an indication of any significant correlation between the two.


Bivariate plot 11 below is another pairs plot displaying correlations between “CreditScoreLower”, and “LoanOriginalAmount” while appling a smoothing function and facet arrangement with a mapping of the “CreditScoreLower” to the color aesthetic as was similarly dones in earlier plots.

Bivariate Plot 11 analysis

‘CreditScoreLower’ vs. ‘LoanOriginationAmount’ were the two variables used in this bivaraiate plot producing a correlation coefficient of 0.341 meaning the relationship between these two variables is at the lower end of the significant level.










Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Bivariate Plot Analysis Summary

The three variables with the highest correlation coefficients were the ‘Overdue Last7Years’ at 0.378, then ‘CreditScoreMean at -0.068, and finally ’CreditScore Lower’ at 0.341.

The variable with the next highest correlation was ‘Freinds AmountInvested’.

Analyzing these results we can conclude that a persons past credit performance for the most part along with some influence from friends support might provide many of the characteristics that lend to predicting future loan outcomes.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The bivariate plot of ‘CreditScoreLower’ vs. ‘LoanOriginalAmount’ calculates a correlation coefficient of 0.341 which was one of the top three strongest correlations discovered thus far.

What was the strongest relationship you found?

The strongest relationship btween two variables I found thus far was between the ‘OverdueLast7Years’ vs. the ‘CurrentDelinquencies’ columns. the correlation cooeficiennt was calculated to be 0.378.










Multivariate Plots Section

Renamed (copied bi_data set) to multi_data for further multivariate analysis

Eleven multivariate plots considering different relationships between three or more variables will follow.

Each of these variables were analyzed above to determine what their relationship was with one other variable (a bivariate analysis).

I’ll be using colors, sizes and shapes to show how these additional varaibles related with each of the first two variables.

In the following plots I’ll be looking at relationships between: CreditScoreMean, CurrentDelinquencies, HomeMortgage, IncomeVerfiable, Term, LoanCode, DebtToIncomeRatio, BorrowerAPR, IncomeRange_ordered, InvestorsFriendCount, FriendsAmountInvested & Occupation.


Multivariate_Plot_a, compares ‘CurrentDelinquencies’ along the x axis with CreditScoreMean’ along the y axis.

The value of the HomeMortgage which can either be true or false is coded in color.

In this plot I am looking to see if having a HomeMortgage is related to CurrentDelinquencies and CreditScoreMean.

Becuase all of the red points fall mainly lower and farther to the right borrowers with home mortgages tend to have better CreditScoreMeans and lower variance in the number of loan payment delinquencies.

Multivariate_Plot_b, compares CreditScoreMean’ along the x axis with ‘CurrentDelinquencies’ along the y axis.

In this plot however, we want to know if having a verified income with the lender is related to delinquent loan payments and or the borrower’s credit score mean.

The low percentage of red points compared to green points indicates very few loans are made to borrowers without a verfied income.

For those loans that have been made to borrowers without a verified income the data suggests that few loans made to this group have the lowest level of payment delinquencies.

Very few of the red points lie along the x axis compared with the green points.

Multivariate_Plot_c, compares’Term’ along the x axis with ‘CurrentDelinquencies’ along the y axis.

It looks as though from this plot, that nearly all loans with a 60 month term are going to borrowers with a home mortgage and the delinquency rate among those loans is relatively low.

On the other hand, loans with a 12 month term appear to be almost evenly disributed among the borrowers who have a home mortgage and borrowers who don’t have a home mortgage.

However, the most prevalent loan term (the 36 month term) data suggests that payment delinquencies by non mortgage holders is far greater than the number of payment delinquencies by the borrowers who have a home mortgage.

Moreover the frequency of current delinquencies (red points) is more prevalent in the 36 month term category as can be seen in the higher relatvie number of red points to green points.

Multivariate_Plot_d, compares ‘CurrentDelinquencies’ along the x axis with ‘LoanCode’ along the y axis.

I’ve added ‘IncomeRange_ordered’ using color visualizing the relationship of ‘IncomeRange_ordered’ with loan purpose listed as ‘LoanCode’ and current payment delinquencies listed as ‘CurrentDelinquencies’.

The conlusion I draw from this plot is that current payment delinquencies are fairly evenly distributed across each of the income groupings. Other than that, there are a few categories of loans that have fewer payment delinquencies that most of the others, such as: “Green Loans”, “Boat Loans”, however this is most likely due to fewer numbers of these loans being made.

Multivariate_Plot_e, compares’DebtToIncomeRatio’ along the x axis with ‘CurrentDelinquencies’ along the y axis.

Using ‘IncomeRange_ordered’ with color to stratify the current payment delinquencies by income group. An initial observation is this lender is reluctant to approve loans above a DTI of greater than roughly 35%, although there are some loans made over that level.

It appears that the number of current delinquencies is fairly evenly distributed among the various income range groupings.

Multivariate_Plot_f, compares ‘BorrowerAPR’ along the x axis with ‘CurrentDelinquencies’ along the y axis.

The plot indicates that the distribution of current delinquencies is normally distributed among the 36 month loan term shown in greed.

It is less obvious what the distribution is for the 12, and 60 month loan terms.

Multivariate_Plot_g, compares ‘BorrowerAPR’ along the x axis with ‘DebtToIncomeRatio’ along the y axis.

To obatain a better visualization the upper 1% of values (outliers) were removed by subsetting the current data.

From this plot we can confirm that the mean ‘BorrowerARR’ is roughly .22 and that the mean DTI is roughly .28.

We can also see that the distribution of each varaible potted against the other is normally shaped. Finally, the distribution with regard to the ‘Term’ appears to be nearly identical for each of the three term lengths.

Multivariate_Plot_h, compares ‘CurrentDelinquencies’ along the x axis with ‘IncomeRange_ordered’ along the y axis using color for the ‘Term’ variable.

This plot shows that the number of current delinquencies is fairly evenly distributed between the upper four income ranges.

Taking a second look at the summary for the ‘IncomeRange_ordered’ varaible we see that only 621 loans were made to borrowers in the lowest income category and that only 7274 loans out of more than 110,000 loans were made to borrowers in the $1-24,999 income range which is about 0.06 percent of total loans.

Multivariate_Plot_i, compares ‘CurrentDelinquencies’ along the x axis with ‘FriendsAmountInvested’ along the y axis using color with the’CreditScoreMean’ variable.

As’FreindsAmountInvested. increases,‘CurrentDelinquencies’ decrease.

This is perhaps one of the fundamental pricipals in how peer funding works.

By having peers involved in the loan process the on time re payments increase.

Multivariate_Plot_j, compares ‘CreditScoreMean’ along the x axis with ‘FriendsAmountInvested’ along the y axis using color for the ‘FreindsAmountInvested’ variable.

The frequency in ‘FreindsAmountInvested’ values appear to show no difference in whether or not the borrower has a home mortgage or not.

Multivariate_Plot_k, compares ‘CurrentDelinquencies’ along the x axis with ‘Occupation’ along the y axis using color for the ‘HomeMortgage’ variable.

From this plot we can visually see that the borrowers without a home mortgage appear to have a greater frequency of current delinquencies compared to the borrowers who have a home mortgage.

This is evidenced by the green points coalescing on the left along the y axis and the red points gravitating toward the right side along the y axis.





Linear Models

The results of the linear model R squared value is displayed below.

## Model R-squared value =  0.2425937










Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Through this analysis I’ve discovered that the ‘CreditScoreMean’ and ‘CurrentDelinquencies’ are inversely related.

Were there any interesting or surprising interactions between features?

Through the analysis I’ve also dsicovered that on average borrowers who have a ‘HomeMortgage’ have fewer “CurrentDelinquencies’ as well as that, the higher the Borrower’s ‘DebtToIncomeRatio’ value, the higher the”BorrowerAPR’ / annual effective interest rate.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a linear model attained an R-squared value of 0.0243 which isn’t too good.


Final Plots and Summary

Plot One

Description One

Final Plot 1 shows the relationship between the dependent or response variable ‘CurrentDelinquencies’ and the independent variable ’CreditScoreMean which was divided into six levels with the cut function.

A regresssion line was plotted through the pints to show that the two variables are inversely related

Plot Two

Description Two

Final Plot 2 shows the relationship between the dependent variable ‘CurrentDelinquencies’ which was plotted on the x axis instead of the y axis for ease of labeling of the tick labels.

Color was added to this plot to show the interrelationship of the ‘CreditScoreMean’ variable. This plot clearly shows that the lower credit scores have the greatest number of payment delinquencies.

Plot Three

Description Three

Final Plot three shows the relationship between ‘LoanCode’ which are the loan categories on the y axis and ‘CurrentDelinquencies’ on the x axis.

The’CreditScoreMean’ variable was cut into seven groupings and used interactively in this plot.

This plot demonstrates the major significance of the ‘CreditScoreMean’ or most credit rating systems in general.

Here we can see that very few delinquencies show up for the lowest credit scores irregardless of laon purpose.

This is most likey because lenders are not lending that much to borrowers with very low credit scores.

On the other hand we can also see that the borrowers with the highest credit scores tend to have the fewest number of delinquent payments irregardles of the purpose of the loans.

We can also see that certain categories of loans are not as prevalent like boat loans and RV loans for example.






Reflection

Conclusions:

This was a challenging project in a number of ways.

First of all, the data set was rather large in both the number of rows and columns.

I found that one of the greatest challenges for me was to get the coding right for making the plots.

Once I could accomplish that task I began to see the shortcomings in the data and had to start making adjustments and modifications to get the plots to work.

I think learning to utilize the ggplot2 package and a number of other packages made the task much easier than it would have without them.

Possible improvements for this project would incorporate model predictions and prediction accuracy calculations.






References:

Session Info

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] xtable_1.8-3       sessioninfo_1.1.1  colorspace_1.4-0  
##  [4] RColorBrewer_1.1-2 reshape2_1.4.3     reshape_0.8.8     
##  [7] quantmod_0.4-13    TTR_0.23-4         xts_0.11-2        
## [10] zoo_1.8-4          purrr_0.3.1        car_3.0-2         
## [13] carData_3.0-2      plotly_4.8.0       leaflet_2.0.2     
## [16] lubridate_1.7.4    memisc_0.99.14.12  MASS_7.3-51.1     
## [19] lattice_0.20-38    scales_1.0.0       gridExtra_2.3     
## [22] ggpubr_0.2         GGally_1.4.0       ggthemes_4.1.0    
## [25] ggplot2_3.1.0      testthat_2.0.1     hms_0.4.2         
## [28] forcats_0.4.0      tibble_2.0.1       broom_0.5.1       
## [31] data.table_1.12.0  dygraphs_1.1.1.6   dplyr_0.8.0.1     
## [34] plyr_1.8.4         stringr_1.4.0      magrittr_1.5      
## [37] readxl_1.3.0       readr_1.3.1        tidyr_0.8.3       
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.0        jsonlite_1.6      viridisLite_0.3.0
##  [4] shiny_1.2.0       assertthat_0.2.0  cellranger_1.1.0 
##  [7] yaml_2.2.0        pillar_1.3.1      backports_1.1.3  
## [10] glue_1.3.0        digest_0.6.18     promises_1.0.1   
## [13] htmltools_0.3.6   httpuv_1.4.5.1    pkgconfig_2.0.2  
## [16] haven_2.1.0       openxlsx_4.1.0    later_0.8.0      
## [19] rio_0.5.16        generics_0.0.2    withr_2.1.2      
## [22] repr_0.19.2       lazyeval_0.2.1    cli_1.0.1        
## [25] crayon_1.3.4      mime_0.6          evaluate_0.13    
## [28] nlme_3.1-137      foreign_0.8-71    tools_3.5.2      
## [31] munsell_0.5.0     zip_1.0.0         compiler_3.5.2   
## [34] rlang_0.3.1       grid_3.5.2        htmlwidgets_1.3  
## [37] crosstalk_1.0.0   labeling_0.3      base64enc_0.1-3  
## [40] rmarkdown_1.11    codetools_0.2-16  gtable_0.2.0     
## [43] abind_1.4-5       curl_3.3          R6_2.4.0         
## [46] knitr_1.21        stringi_1.3.1     Rcpp_1.0.0       
## [49] tidyselect_0.2.5  xfun_0.5
## [1] "Fri Mar  8 12:57:25 2019"